-
◊ data-driven learning applied to learning Arabic
- ◊ concordance-based tools are better done offline (→ godot), because you often need large amounts of data...
Possible Sources
- Sinai Corpus : the first thing you find, but weird encoding, and I'm not sure how to use it (sparsely documented)
- https://github.com/linuxscout/tashkeela2/blob/master/data/Interviews/Int07.xml seems like a decent amount of data (diverse topics), but it's spread out across folders and xml files which would need cleaning and merging
- 1800 Tweets , Jordanian or MSA, but in xlsx
- Arabic wikipedia , comes with dump, corpus and instructions
- undocumented zip corpus
- Arabic Big Corpus , which is not actually very big and may be exclusively Qran